This is the Optimization Demonstration for GBM-BRT

Approach

Uncontrained Optimization

  1. Use a combination of prediction and interpolation to predict all values oftime and accuracy under different hardware/software configurations.
  2. Find the minimal combination of algorithm inputs that maximize accuracy. If there are ties, break them by using the point that requires the least data.
  3. Find the costs associated with running the algorithm with those inputs on all different hardware configurations.
  4. Find the combination of hardware that jointly minimizes cost and time.

Data-contrained Optimization

  1. Use a combination of prediction and interpolation to predict all values oftime and accuracy under different hardware/software configurations.
  2. Subset the accuracy surface produced above to the amount of data available. The maximizing point will fall in the upper right corner of the subsetted space.
  3. Find the costs associated with running the algorithms with the accuracy-maximizing point on all different hardwares.
  4. Find the combination of hardware that jointly minimizes time and cost.

Cost-constrained Optimization

  1. Use a combination of prediction and interpolation to predict all values oftime and accuracy under different hardware/software configurations.
  2. Subset the space produced above to the amount of time and money able to be spent on modelling.
  3. Working backwards now, find the accuracies that can be produced in the limited time.
  4. Using the subset of accuracy space, find the combination of algorithm inputs that maximizes accuracy.
knitr::opts_chunk$set(cache=TRUE, echo=F, warning=F, error = F, message=F)
knitr::opts_knit$set(root.dir = "/users/scottsfarley/documents")
setwd("/users/scottsfarley/documents")
library(parallel)
library(doParallel)
## Loading required package: foreach
## Loading required package: iterators
library(akima)
library(ggplot2)
options(java.parameters = "-Xmx1500m")
library(bartMachine)
## Loading required package: rJava
## Loading required package: bartMachineJARs
## Loading required package: car
## Warning: replacing previous import 'lme4::sigma' by 'stats::sigma' when
## loading 'pbkrtest'
## Loading required package: randomForest
## randomForest 4.6-12
## Type rfNews() to see new features/changes/bug fixes.
## 
## Attaching package: 'randomForest'
## The following object is masked from 'package:ggplot2':
## 
##     margin
## Loading required package: missForest
## Loading required package: itertools
## Welcome to bartMachine v1.2.3! You have 1.4GB memory available.
## 
## If you run out of memory, restart R, and use e.g.
## 'options(java.parameters = "-Xmx5g")' for 5GB of RAM before you call
## 'library(bartMachine)'.
bartMachine::set_bart_machine_num_cores(3)
## bartMachine now using 3 cores.
library(reshape2)
library(ggdendro)
threshold.time <- 20 ##seconds
threshold.cost <- Inf ##cents
threshold.numTex <- 2500

First, get the training data and fit the model. Perform some skill checks on it.

## bartMachine initializing with 50 trees...
## bartMachine vars checked...
## bartMachine java init...
## bartMachine factors created...
## bartMachine before preprocess...
## bartMachine after preprocess... 6 total features...
## bartMachine sigsq estimated...
## bartMachine training data finalized...
## Now building bartMachine for regression ...
## serializing in order to be saved for future R sessions...done
## bartMachine initializing with 50 trees...
## bartMachine vars checked...
## bartMachine java init...
## bartMachine factors created...
## bartMachine before preprocess...
## bartMachine after preprocess... 6 total features...
## bartMachine sigsq estimated...
## bartMachine training data finalized...
## Now building bartMachine for regression ...
## serializing in order to be saved for future R sessions...done

Choose a finite number of possible solutions to the model. Ideally, we would want every single combination of predictor variables [0, Inf]. This is obviously intractable. Moreover, I only have data for a subset of that space anyways. So randomly sample the subspace in which I have data to make the problem possible to solve.

Using that subset of data and the models we fit previously, predict each candidate configuration of algorithm inputs and hardware variables for execution time and SDM accuracy.

Plot the posterior means of the accuracy models against the algorithm inputs that should control accuracy. In this case, these are number of training examples and number of covariates.

The accuracy clearly varies from low (few training examples and few covariates) to very high (many covariates, many training examples). Perhaps more data would be helpful here, but what are you going to do. Our task is to find the combinations of inputs that results in the highest accuracy model. If there’s a tie, find the combination that needs the least data.

Now, we know the combination of algorithm inputs that result in the highest accuracy. The figure below shows the combination identified on the training examples and covariates axes. This combination of training examples and number of covariates can be run on any combination of hardware. Some might be suboptimal. Thus, at this point, we’ve solved half of our challenge: algorithm inputs have been optimized, now it’s time optimize hardware.

## [1] "Accuracy is maximized at 10000 training examples and 5 predictors."

In theory, the hardware parameters should not affect the SDM accuracy. We can test this assumption here, by plotting the accuracies obtained for this combination of algorithm inputs against modeled accuracy on the number of CPUs and amount of memory. If the assumption is valid, the plot should show no change in either the horizontal or vertical directions. We see that there is, in fact, some change, though. This is likely due to expeirmental design, and lack of a full factorial design setup. The effect is realtively minor, and I choose to comment it and move along.

## [1] "Accuracy Range on Hardware:  0.0150100805696748"
## [1] "Accuracy Range from Expectation:  0"
## [1] "------"
## [1] "Fixing accuracy at:  0.810161563759954"

Now, fix the algorithm inputs at the accuracy-maximizing point– effectively fixing expected model accuracy. An algorithm with these inputs can be run on any combination of hardware. Project how long that specific model would take and how much it would cost on all computing types. Plot those out on time vs. cost axes.

The optimal solution is the one that balances time and cost equally during the minimization. We use euclidean distance here, which normalizes each dimension by its standard deviation, so they are weighted equally. For each candidate combiantion of hardware, we calculate the distance between it and the origin of these two axes. We then find the minimum of that distance matrix and call that point the optimal.

Our job is complete. We’ve now optimized both the harware and software dimensions of the problem.

## [1] "------GBM-BRT OPTIMAL--------"
## [1] "Predicted Optimal Accuracy 0.810161563759954 +/- 0"
## [1] "Predicted Optimal Cost (seconds) 2454.45778774416"
## [1] "Predicted Optimal Cost (cents) 147.562002199179"
## [1] "Cores:  2"
## [1] "Memory: 2"
## [1] "Training Examples: 10000"
## [1] "Covariates: 5"
## [1] "Distance from origin is:  2458.88950063052"

Everything up to this point was done using the mean of the posterior distribution, a choice which simplifies the process but causes some information loss and may cause over-confidence in the predictions. We can modify our steps to include information from the entire posterior, which may solve this issue.

Instead of projecting just the mean time and mean cost for use the the distance minimization, use the entire set of posterior samples. Calculate the distance metric for each sample in the posterior independently. You’re then left with a density distribution of distances, from which we can infer the minimum value.

The posteriors are in a line, since there’s a fixed linear relationship between time and cost.

Now, find the distance metrics for all of those points.

There’s a lot of overlab in this figure, and many points are far away from the optimal. We don’t care about those. Take the few closest to the minimum and look at their distributions.

Now, the optimal configuration may be one of the following:

config cores GBMemory seconds cost distance.mean distance.sd
13 2 2 2454.458 147.56200 3082.122 2043.470
1 1 2 2899.923 87.17167 3589.252 2501.722
97 9 2 3095.936 837.57449 4250.550 3487.269
109 10 2 3095.936 930.63832 4284.414 3515.052
73 7 2 3248.501 683.54951 4402.469 3618.003
85 8 2 3248.501 781.19944 4430.947 3641.407
2 1 4 3681.212 184.42873 4464.996 2885.752
3 1 5 3684.666 221.52210 4472.791 2891.759
4 1 6 3684.666 258.44244 4475.698 2893.639
5 1 8 3656.366 329.73107 4490.874 3012.681
49 5 2 3324.383 499.65474 4504.398 3790.162
6 1 10 3675.279 405.08926 4519.489 3012.651
61 6 2 3324.383 599.58568 4526.237 3808.538
7 1 12 3675.279 478.74186 4530.236 3019.815
8 1 14 3675.535 552.43293 4543.131 3028.379
9 1 16 3719.173 633.52394 4609.441 3071.496
10 1 18 3883.038 739.25278 4834.776 3231.026
11 1 20 3883.293 817.12249 4853.872 3243.776
12 1 22 3882.551 894.77279 4873.710 3258.007
25 3 2 3646.952 328.88213 4944.136 4264.735
37 4 2 3646.952 438.50950 4959.622 4278.093
121 11 2 3791.996 1253.86137 5297.177 4306.595
133 12 2 3791.996 1367.84877 5346.567 4346.748
145 13 2 3806.300 1487.42583 5411.119 4387.689
157 14 2 3806.300 1601.84320 5468.082 4433.879

config cores GBMemory seconds cost distance.mean distance.sd cluster
13 2 2 2454.458 147.562 3082.122 2043.47 1

In the results above, you’re accutally seeing the trade off between time and money play out quite nicely. Adding cores costs money, but, in the case of GBM-BRTs, reduces time. Here, that tradeoff basically exactly evens out.

Data Constraint

In this case, we’ve got a constraint on the amount of data available to us.

## [1] "Current data threshold is  2500"
## [1] "Accuracy is maximized at 2000 training examples and 5 predictors."
## [1] "Expected Max Accuracy is  0.797612643098399"

Cost Constraint II

config cores GBMemory seconds cost distance.mean distance.sd
25 3 2 19.90046 1.794624 2.643884 0.3724785
37 4 2 19.90046 2.392831 2.652165 0.3736452
49 5 2 19.89138 2.989675 2.683208 0.3635854
61 6 2 19.89138 3.587609 2.696217 0.3653481
73 7 2 19.95624 4.199192 2.702726 0.3606120
85 8 2 19.95624 4.799076 2.720210 0.3629447
121 11 2 19.94062 6.593566 2.785321 0.3757077
133 12 2 19.94062 7.192982 2.811290 0.3792107
10 1 18 19.98821 3.805355 2.813298 0.1312125
145 13 2 19.89891 7.776095 2.843459 0.3851853
157 14 2 19.89891 8.374256 2.873392 0.3892403
169 15 2 19.89891 8.972418 2.905200 0.3935490
181 16 2 19.89891 9.570579 2.938820 0.3981033
193 17 2 19.89891 10.168740 2.974192 0.4028949
97 9 2 19.92188 5.389666 3.200432 0.2801042

config cores GBMemory seconds cost distance.mean distance.sd cluster
25 3 2 19.90046 1.794624 2.643884 0.3724785 1
37 4 2 19.90046 2.392831 2.652165 0.3736452 1
49 5 2 19.89138 2.989675 2.683208 0.3635854 1
61 6 2 19.89138 3.587609 2.696217 0.3653481 1